Michael
Regenscheit, University of Konstanz, michael.regenscheit@uni-konstanz.de
Christian Scheible, University of Konstanz, christian.scheible@uni-konstanz.de
Thomas Ramm, University of Konstanz, thomas.ramm@uni-konstanz.de
KNIME: Konstanz Information Miner (http://www.knime.org/) is a user-friendly and
comprehensive open-source data integration, processing, analysis, and
exploration platform.
Jigsaw:
(http://www.cc.gatech.edu/gvu/ii/jigsaw/index.html)
is a visual analytics system that enables analysts and researchers to explore,
analyze, and make sense of document collections.
KIAWordCloudVis:
The Konstanz Intelligence Agency Word Cloud Visualization is a full text search
and visualization tool providing a fast overview of a document collection. We
developed it for the VAST Challenge 2011 making use of Apache Lucene for
indexing and the IBM Word-Cloud Generator for the word clouds.
Apache
Lucene: (http://lucene.apache.org/java/docs/index.html)
is a high-performance, full-featured text search
engine library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, in particular for
cross-platform applications.
Word-Cloud
Generator: (http://www.alphaworks.ibm.com/tech/wordcloud)
is a Java application that creates word clouds from any source text. It's built
on the same technology that powers the popular "Wordle" web
application.
NER:
Stanford Named Entity Recognizer (http://nlp.stanford.edu/software/CRF-NER.shtml)
is a Java implementation of a Named Entity Recognizer. Named Entity Recognition
(NER) labels sequences of words in a text which are proper names, such as
person and company names, or gene and protein names.
Mallet:
MAchine Learning for LanguagE Toolkit (http://mallet.cs.umass.edu/topics.php)
Topic models provide a simple way to analyze large volumes of unlabeled text. A
"topic" consists of a cluster of words that frequently occur
together. Using contextual clues, topic models can connect words with similar
meanings and distinguish between uses of words with multiple meanings. (We used
this tool, but none of the resulting topics did fit to our task, so it’s not
mentioned in the text)
Video:
ANSWERS:
MC 3.1 Potential
Threats: Identify any imminent terrorist threats in the Vastopolis metropolitan
area. Provide detailed information on the threat or threats (e.g. who, what,
where, when, and how) so that officials can conduct counterintelligence
activities. Also, provide a list of the evidential documents supporting your
answer.
To find the relevant documents we used
a combination of different tools and methods as shown in Figure 1. As a first step we retrieved
a list of terrorism key words http://www.myvocabulary.com/index.php?dir=wordlist&file=word_list&wordlist_id=197
and manually reviewed and adapted it in about 20 minutes. We used this list to
sort the articles according to their relevance (by the number of terror words
occurring in the text) and scanned the top results manually to train two
classifiers in KNIME. We joined the results of these two algorithms and added
further texts containing at least 3 terror words. As a result we got a subset
(826) of all documents (4474) containing texts that are very likely to report
about terrorist activities. All these steps together took us approximately nine
man-hours.
Figure 1: This diagram shows all steps of the analytic process. The blue color indicates automatic preprocessing steps; the red ones are manual steps and the orange ones indicate interactive visual analyses based on different tools.
To
support an interactive exploration of the documents we developed
KIAWordCloudVis, a full text search and visualization tool (Figure 2 and Figure 3), in four man-days. With
the help of this tool we performed an iterative search at a powerwall (a huge,
high resolution screen) on the whole corpus to detect previously undiscovered
interesting documents (about 15 man-hours). Combining the strengths of Lucene,
the Stanford NER, a Porter Stemmer and IBMs Word-Cloud Generator the tool is a
full text search engine offering possibilities to identify highly relevant
texts at a glance. It helped us to get entry points for deeper investigations
with KIAWordCloudVis as well as with Jigsaw (Figure 3). In the example
below you can see that the interactive visual search with the highlighted
terror words (BOMB, SCARE) enabled us instantly to discover a text (274) about
a bomb threat in Vastopolis that we had not been aware of before. We will show
next how we used this particular result for further investigations with Jigsaw
(about 15 man-hours).
Figure 2:
Screenshot of KIAWordCloudVis with the simple boolean search string bomb
AND vastopolis. The marked text (274) is one of our evidence texts we used for
further investigation in Jigsaw and on the bottom right corner you can see the
more detailed version of the same text but showing a text summarization word
cloud.
Figure 3: Screenshot of KIAWordCloudVis. On the upper left side you can see word clouds representing days, on the upper right side there is a more detailed version of such day clouds. In the lower half there are word clouds for single articles and again on the right side the detailed version. At the very top there is the search field with a complex search string. Selected is a text which is in our result set about intercepted communication from the Network of Dread.
Jigsaw
offers different possibilities to analyze a corpus. It contains cluster
algorithms as well as entity recognition methods and visualizations like
parallel coordinates, graphs and scatterplots. We used Stanford NER for entity
extraction instead of Jigsaw’s method in order to get consistent results with
KIAWordCloudVis and to improve the entity recognition. We also reduced the
number of documents loaded in Jigsaw with the methods described in the first
paragraph (candidates instead of the whole document space) with the goal to
improve the clustering results. With these steps we got two clusters containing
almost exclusively relevant texts. We also got a cluster with the Antarctica
Airline crash which contains a lot of texts mentioning terror but not important
for our task.
Figure 4: Screenshot is showing Jigsaws cluster- and document-view. We found the cluster with the highlighted text on the left by selecting the relevant documents we found with KIAWordCloudVis (274) in the List View on the right. As you can see the cluster contains 8 of 23 evidence documents (highlighted in the list on the left).
Performing
the outlined analytical steps led us to several findings (Figure 5) pointing to
potential threats. Our first finding is an attack with a dirty bomb by the
network of dread, an overseas terror group. There are several hints for this
attack starting with intercepted communication indicating attacks across the
country. Also there were some threatening emails to VastPress one day later.
They might have tried to get radioactive material by ship because radioactive
cargo was found at Vastopolis harbor. Several days later a plot to detonate a
dirty bomb in an American city was revealed. This shows that this threat is
really serious.
The
second scenario is about bioterrorism where two different terror groups are
involved. The first hint is that the molecular biologist Prof. Patino gave a
talk on new dangers of bioterrorism, saying that it is much easier today “to
engineer dangerous microbes with the right equipment”. On the 18th
of April the CDC published an article saying that an easy way to spread a
disease would be food poisoning. There was also some biological equipment
stolen which could be used to cultivate bacteria. Maybe it was this equipment
which was found when they arrested members of PoC building a laboratory. But
this doesn’t rule out the possibility that they are going to contaminate food,
because on May 15th two members of PoC where trespassing near a loading
dock of a food preparation plant. So it’s likely that they are planning an
attack on food.
The
other group which might realize an attack is the Citizens for the Ethical
Treatment of Lab Mice. But this scenario is unlikely because the found evidence
sounds rather harmless (including issues like trashing Prof. Patino’s garage
and screaming at his neighbors).
Our
third scenario summarizes miscellaneous attacks in Vastopolis. There were
weapons stolen from the armed forces and part of it was found in the car of a
network of hate members. So it is possible that this group planned an assault
with military weapons. Another group called psycho brotherhood tries to build
bombs and might want to detonate them in Vastopolis.
Figure 5 : Timeline of events showing three possible threats for Vastopolis. The first one is an attack with a (nuclear) dirty bomb, the second one a bio-terroristic assault and the last one combines miscellaneous threats.
8
62
129
274
383
499
1088
1671
1691
1785
1878
2287
2395
3040
3212
3229
3231
3232
3435
3563
4080